Skip to content

Refactor XPath evaluation and optimize memory allocation#1

Merged
jeffhuen merged 3 commits intomainfrom
claude/fix-memory-leaks-rust-bCi0M
Feb 16, 2026
Merged

Refactor XPath evaluation and optimize memory allocation#1
jeffhuen merged 3 commits intomainfrom
claude/fix-memory-leaks-rust-bCi0M

Conversation

@jeffhuen
Copy link
Copy Markdown
Owner

Summary

This PR refactors XPath evaluation logic, consolidates duplicate code, and optimizes memory allocation patterns throughout the codebase. The changes improve correctness of XPath comparisons, reduce unnecessary cloning, and prevent unbounded memory growth in long-lived parsers and accumulators.

Key Changes

XPath Evaluation Improvements

  • Consolidated node text extraction: Moved get_node_text_content and collect_text_content logic into a shared dom::node_string_value function to eliminate duplication and ensure consistent XPath string-value semantics across the codebase
  • Fixed XPath comparison semantics: Updated compare_values to use actual node string-values instead of formatting raw node IDs as numbers, which was both semantically incorrect and wasteful
  • Added #[must_use] attributes: Applied to evaluate, evaluate_from_node, compile, and XPathValue to catch unused results at compile time

Memory Optimization

  • XPath expression caching: Changed cache to store Arc<CompiledExpr> instead of cloning entire compiled expressions on cache hits, reducing allocations from deep clones to cheap pointer bumps
  • Streaming parser buffer management: Added shrink_to() calls after draining buffers to prevent unbounded capacity growth in long-lived parsers
  • Event vector shrinking: Shrink events and complete_elements vectors after partial drains to release excess capacity
  • Document accumulator sizing: Reduced initial buffer capacity from 64KB to 4KB to avoid wasting memory for small documents while still allowing growth as needed
  • Index builder cleanup: Added shrink_to_fit() call after building document indices to reclaim over-allocated capacity from initial size estimates

Error Handling

  • NIF safety: Updated empty_binary to handle potential allocation failures gracefully instead of panicking in a BEAM NIF

Documentation

  • Enhanced comments explaining XPath 1.0 spec compliance, caching strategy, and memory management rationale

https://claude.ai/code/session_015igpdCrNYKuoPrHWZ5RXYc

claude and others added 3 commits February 16, 2026 18:07
Memory leak fixes:
- StreamingParser: buffer, events, and complete_elements vecs never
  shrank after drain operations, causing unbounded growth in long-lived
  parsers. Added shrink_to() calls matching StreamingSaxParser behavior.
- StructuralIndex: shrink_to_fit() existed but was never called after
  building. Initial capacity estimates over-allocate by 2-3x; now
  reclaimed immediately after build_children_from_parents().
- DocumentAccumulator: reduced default pre-allocation from 64KB to 4KB.
  The old 64KB multiplied quickly across concurrent accumulators.

Connection/contention fix:
- XPath cache: replaced deep CompiledExpr cloning with Arc<CompiledExpr>.
  Every cache hit previously cloned all Vec<Op>, Strings, and
  Box<CompiledExpr> recursively. Now it's a cheap Arc pointer bump.

Correctness + memory fix:
- compare_values in XPath eval: was using format!("{}", node_id) to
  compare nodes, which compared raw u32 IDs as strings — both wrong
  per XPath 1.0 spec and wasteful (O(n*m) String allocations). Now
  uses actual document text content via node_string_value().
- Consolidated duplicated get_node_text_content/collect_text_content
  from lib.rs into shared dom::node_string_value().

Safety fix:
- empty_binary: replaced .unwrap() on OwnedBinary::new(0) with match
  to avoid potential NIF panic.

Best practices:
- Added #[must_use] to evaluate(), evaluate_from_node(), compile(),
  validate_strict(), and XPathValue type.

https://claude.ai/code/session_015igpdCrNYKuoPrHWZ5RXYc
…e all clippy warnings

- Fix XPathValue::to_string_value() to return empty string for NodeSets
  instead of misleading format!("[node:{}]", id) — callers with document
  access now use dom::node_string_value() per XPath 1.0 spec
- Add resolve_string() helper to XPath functions for proper NodeSet-to-string
  conversion; pass document access to all 8 string functions
- Replace duplicated get_string_value/collect_text with dom::node_string_value
- Remove residual get_node_text_content wrapper from NIF layer
- Optimize NodeSet-vs-NodeSet comparison from O(n*m) to O(n+m) by
  pre-computing right-side string values
- Fix streaming parser shrink_to() reallocation churn with 4x threshold
- Remove redundant #[must_use] on compile() (Result already has it)
- Remove dead inherent methods on XmlDocument duplicated by DocumentAccess trait
- Remove unused XmlAttribute re-export from dom/mod.rs
- Fix len_zero clippy warnings in unified_scanner tests
- Add #[expect(clippy::ptr_arg)] to intern_cow (needs Cow for zero-copy optimization)
- Add 4 correctness tests for NodeSet equality and string function semantics

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace expect() in XPath parser peek() with match + defensive Eof
  fallback — eliminates BEAM VM crash risk if invariant is violated
- Replace expect() in XPath lexer read_string() with unwrap_or
  defensive fallback — same rationale
- Thread document access into compare_numbers() so relational operators
  (< <= > >=) properly resolve NodeSet text content before numeric
  conversion — fixes silent wrong results for expressions like
  /r/price > 10 where <price>42.5</price>
- Add resolve_number() helper mirroring resolve_string() pattern
- Add test for relational operators on NodeSets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jeffhuen jeffhuen merged commit f33c37a into main Feb 16, 2026
@jeffhuen jeffhuen deleted the claude/fix-memory-leaks-rust-bCi0M branch February 16, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants